3 research outputs found
A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding
Multi-intent detection and slot filling joint models are gaining increasing
traction since they are closer to complicated real-world scenarios. However,
existing approaches (1) focus on identifying implicit correlations between
utterances and one-hot encoded labels in both tasks while ignoring explicit
label characteristics; (2) directly incorporate multi-intent information for
each token, which could lead to incorrect slot prediction due to the
introduction of irrelevant intent. In this paper, we propose a framework termed
DGIF, which first leverages the semantic information of labels to give the
model additional signals and enriched priors. Then, a multi-grain interactive
graph is constructed to model correlations between intents and slots.
Specifically, we propose a novel approach to construct the interactive graph
based on the injection of label semantics, which can automatically update the
graph to better alleviate error propagation. Experimental results show that our
framework significantly outperforms existing approaches, obtaining a relative
improvement of 13.7% over the previous best model on the MixATIS dataset in
overall accuracy.Comment: Submitted to ICASSP 202
Video Referring Expression Comprehension via Transformer with Content-conditioned Query
Video Referring Expression Comprehension (REC) aims to localize a target
object in videos based on the queried natural language. Recent improvements in
video REC have been made using Transformer-based methods with learnable
queries. However, we contend that this naive query design is not ideal given
the open-world nature of video REC brought by text supervision. With numerous
potential semantic categories, relying on only a few slow-updated queries is
insufficient to characterize them. Our solution to this problem is to create
dynamic queries that are conditioned on both the input video and language to
model the diverse objects referred to. Specifically, we place a fixed number of
learnable bounding boxes throughout the frame and use corresponding region
features to provide prior information. Also, we noticed that current query
features overlook the importance of cross-modal alignment. To address this, we
align specific phrases in the sentence with semantically relevant visual areas,
annotating them in existing video datasets (VID-Sentence and VidSTG). By
incorporating these two designs, our proposed model (called ConFormer)
outperforms other models on widely benchmarked datasets. For example, in the
testing split of VID-Sentence dataset, ConFormer achieves 8.75% absolute
improvement on [email protected] compared to the previous state-of-the-art model.Comment: Accepted to ACM International Conference on Multimedia Workshop (ACM
MM), 2023. arXiv admin note: substantial text overlap with arXiv:2210.0295